open-source project
Automated Duplicate Bug Report Detection in Large Open Bug Repositories
Laney, Clare E., Barovic, Andrew, Moin, Armin
Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. However, they sometimes report issues that have already been reported. First, they may not have time to do sufficient research on existing bug reports. Second, they may not possess the right expertise in that specific area to realize that an existing bug report is essentially elaborating on the same matter, perhaps with a different wording. In this paper, we propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository based on the textual data in the reports. We present six alternative methods: Topic modeling, Gaussian Naive Bayes, deep learning, time-based organization, clustering, and summarization using a generative pre-trained transformer large language model. Additionally, we introduce a novel threshold-based approach for duplicate identification, in contrast to the conventional top-k selection method that has been widely used in the literature. Our approach demonstrates promising results across all the proposed methods, achieving accuracy rates ranging from the high 70%'s to the low 90%'s. We evaluated our methods on a public dataset of issues belonging to an Eclipse open-source project.
- North America > United States > Texas > Harris County > Spring (0.04)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration
Hu, Junze, Jin, Xiangyu, Zeng, Yizhe, Liu, Yuling, Li, Yunpeng, Du, Dan, Xie, Kaiyu, Zhu, Hongsong
-- Code auditing, a method where security researchers review source code to identify vulnerabilities, has become increasingly impractical for large-scale open-source projects. While Large Language Models (LLMs) demonstrate impressive code generation capabilities, they are constrained by limitations in context window size, memory capacity, and complex reasoning abilities, making direct vulnerability detection across entire projects infeasible. Static code analysis tools, though effective to a degree, are heavily reliant on their predefined scanning rules. T o address these challenges, we present QLPro, a vulnerability detection framework that systematically integrates LLMs with static code analysis tools. QLPro introduces both a triple-voting mechanism and a three-role mechanism to enable fully automated vulnerability detection across entire open-source projects without human intervention. Specifically, QLPro first utilizes static analysis tools to extract all taint specifications from a project, then employs LLMs and the triple-voting mechanism to classify and match these taint specifications, thereby enhancing both the accuracy and appropriateness of taint specification classification.
The Evolution of Darija Open Dataset: Introducing Version 2
Outchakoucht, Aissam, Es-Samaali, Hamza
Darija Open Dataset (DODa) represents an open-source project aimed at enhancing Natural Language Processing capabilities for the Moroccan dialect, Darija. With approximately 100,000 entries, DODa stands as the largest collaborative project of its kind for Darija-English translation. The dataset features semantic and syntactic categorizations, variations in spelling, verb conjugations across multiple tenses, as well as tens of thousands of translated sentences. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications. The availability of such dataset is critical for developing applications that can accurately understand and generate Darija, thus supporting the linguistic needs of the Moroccan community and potentially extending to similar dialects in neighboring regions. This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements that will continue to promote its use and expansion in the global NLP landscape.
Understanding the Helpfulness of Stale Bot for Pull-based Development: An Empirical Study of 20 Large Open-Source Projects
Khatoonabadi, SayedHassan, Costa, Diego Elias, Mujahid, Suhaib, Shihab, Emad
Pull Requests (PRs) that are neither progressed nor resolved clutter the list of PRs, making it difficult for the maintainers to manage and prioritize unresolved PRs. To automatically track, follow up, and close such inactive PRs, Stale bot was introduced by GitHub. Despite its increasing adoption, there are ongoing debates on whether using Stale bot alleviates or exacerbates the problem of inactive PRs. To better understand if and how Stale bot helps projects in their pull-based development workflow, we perform an empirical study of 20 large and popular open-source projects. We find that Stale bot can help deal with a backlog of unresolved PRs as the projects closed more PRs within the first few months of adoption. Moreover, Stale bot can help improve the efficiency of the PR review process as the projects reviewed PRs that ended up merged and resolved PRs that ended up closed faster after the adoption. However, Stale bot can also negatively affect the contributors as the projects experienced a considerable decrease in their number of active contributors after the adoption. Therefore, relying solely on Stale bot to deal with inactive PRs may lead to decreased community engagement and an increased probability of contributor abandonment.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations
Tony, Catherine, Mutas, Markus, Ferreyra, Nicolás E. Díaz, Scandariato, Riccardo
Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLMSecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (4 more...)
Automatically Identifying Relations Between Self-Admitted Technical Debt Across Different Sources
Li, Yikun, Soliman, Mohamed, Avgeriou, Paris
Self-Admitted Technical Debt or SATD can be found in various sources, such as source code comments, commit messages, issue tracking systems, and pull requests. Previous research has established the existence of relations between SATD items in different sources; such relations can be useful for investigating and improving SATD management. However, there is currently a lack of approaches for automatically detecting these SATD relations. To address this, we proposed and evaluated approaches for automatically identifying SATD relations across different sources. Our findings show that our approach outperforms baseline approaches by a large margin, achieving an average F1-score of 0.829 in identifying relations between SATD items. Moreover, we explored the characteristics of SATD relations in 103 open-source projects and describe nine major cases in which related SATD is documented in a second source, and give a quantitative overview of 26 kinds of relations.
- South America > Uruguay > Maldonado > Maldonado (0.05)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Europe > Netherlands (0.04)
- Asia > China (0.04)
best way to be a machine learning engineer
Becoming a machine learning engineer requires a combination of skills and knowledge in various areas such as mathematics, programming, data analysis, and machine learning algorithms. Learn the basics of mathematics and statistics: Machine learning requires a strong foundation in mathematics and statistics. You should be familiar with calculus, linear algebra, probability, and statistics. Master a programming language: You should learn a programming language such as Python or R, which are commonly used for machine learning. You should also be familiar with data structures, algorithms, and object-oriented programming.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.57)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.57)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.53)
Behavior Trees and State Machines in Robotics Applications
Ghzouli, Razan, Berger, Thorsten, Johnsen, Einar Broch, Wasowski, Andrzej, Dragule, Swaib
Autonomous robots combine skills to form increasingly complex behaviors, called missions. While skills are often programmed at a relatively low abstraction level, their coordination is architecturally separated and often expressed in higher-level languages or frameworks. State machines have been the go-to language to model behavior for decades, but recently, behavior trees have gained attention among roboticists. Although several implementations of behavior trees are in use, little is known about their usage and scope in the real world.How do concepts offered by behavior trees relate to traditional languages, such as state machines? How are concepts in behavior trees and state machines used in actual applications? This paper is a study of the key language concepts in behavior trees as realized in domain-specific languages (DSLs), internal and external DSLs offered as libraries, and their use in open-source robotic applications supported by the Robot Operating System (ROS). We analyze behavior-tree DSLs and compare them to the standard language for behavior models in robotics:state machines. We identify DSLs for both behavior-modeling languages, and we analyze five in-depth.We mine open-source repositories for robotic applications that use the analyzed DSLs and analyze their usage. We identify similarities between behavior trees and state machines in terms of language design and the concepts offered to accommodate the needs of the robotics domain. We observed that the usage of behavior-tree DSLs in open-source projects is increasing rapidly. We observed similar usage patterns at model structure and at code reuse in the behavior-tree and state-machine models within the mined open-source projects. We contribute all extracted models as a dataset, hoping to inspire the community to use and further develop behavior trees, associated tools, and analysis techniques.
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- (11 more...)
- Research Report (0.64)
- Instructional Material (0.46)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)
Infrastructure in '23
For the third year running, I set aside some time at the beginning of the year to share what I believe to be the most dynamic and important areas of innovation in infrastructure. If you share my interest in any one or more of these areas, I would love to hear from you. The future of cloud is here, and it's Javascript While I have written previously about the rise of serverless computing, I was slow to appreciate the role Javascript would play in pushing it forward. Javascript is the only language that lives up to "write once, run anywhere." It has the most vibrant ecosystem of any language on the planet, unmatched startup times, and is secure enough to run untrusted code on behalf of users without modification or special tooling.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Software (0.97)
Trust in Motion: Capturing Trust Ascendancy in Open-Source Projects using Hybrid AI
Sanchez, Huascar, Hitaj, Briland
Open-source is frequently described as a driver for unprecedented communication and collaboration, and the process works best when projects support teamwork. Yet, open-source cooperation processes in no way protect project contributors from considerations of trust, power, and influence. Indeed, achieving the level of trust necessary to contribute to a project and thus influence its direction is a constant process of change, and developers take many different routes over many communication channels to achieve it. We refer to this process of influence-seeking and trust-building as trust ascendancy. This paper describes a methodology for understanding the notion of trust ascendancy and introduces the capabilities that are needed to localize trust ascendancy operations happening over open-source projects. Much of the prior work in understanding trust in open-source software development has focused on a static view of the problem using different forms of quantity measures. However, trust ascendancy is not static, but rather adapts to changes in the open-source ecosystem in response to new input. This paper is the first attempt to articulate and study these signals from a dynamic view of the problem. In that respect, we identify related work that may help illuminate research challenges, implementation tradeoffs, and complementary solutions. Our preliminary results show the effectiveness of our method at capturing the trust ascendancy developed by individuals involved in a well-documented 2020 social engineering attack. Our future plans highlight research challenges and encourage cross-disciplinary collaboration to create more automated, accurate, and efficient ways to model and then track trust ascendancy in open-source projects.
- North America > United States > Minnesota (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)
- Information Technology > Security & Privacy (0.48)
- Government (0.47)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (0.94)